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Abstract — We design and analyze gossip algoritlims for net- 
worlis with correlated data. In these networks, either the data to 
be distributed, the data already available at the nodes, or both, 
are correlated. Although coding schemes for correlated data have 
been studied extensively, the focus has been on characterizing 
the rate region in static memory-free networks. In a gossip- 
based scheme, however, nodes communicate among each other by 
continuously exchanging packets according to some underlying 
communication model. The main figure of merit in this setting is 
the stopping time - the time required until nodes can successfully 
decode. While Gossip schemes are practical, distributed and 
scalable, they have only been studied for uncorrelated data. 

We wish to close this gap by providing techniques to analyze 
network coded gossip in (dynamic) networks with correlated data. 
We give a clean framework for oblivious network models that 
applies to a multitude of network and communication scenarios, 
specify a general setting for distributed correlated data, and give 
tight bounds on the stopping times of network coded protocols 
in this wide range of scenarios. 

I. Introduction 

In this paper, we design and analyze information dissemi- 
nation algorithms in communication networks with correlated 
data. In these networks, either the data to be distributed, the 
data already available at the nodes, or both, are correlated. 
This problem arises in a many networking applications, such 
as sensor, peer-to-peer or content distribution networks. One 
such example is a large set of distributed temperature sensors 
with a clock at the receiver Both the temperatures at different 
sensor locations and the time at which a measurement is taken 
have high correlations among each other 

While the current information theory literature includes 
several coding schemes for correlated data, the focus in these 
works is mainly on characterizing the rate region - the set of 
achievable rates. On the other hand, recent work in the net- 
working literature offers a multitude of efficient, decentrahzed 
and address-oblivious schemes for information dissemination 
(e.g., randomized gossip). Unfortunately these schemes treat 
the data as uncorrelated and neglect any available information 
at the receivers. The focus of this paper is to close this gap and 
give tools for analyzing gossip-based algorithms in networks 
with correlated data. 

A. Related Work 

Distributed Source Coding: Distributed compression has 
been studied in information theory mainly through small, 
canonical problems. In [jj, Slepian and Wolf considered the 
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problem of separately encoding two correlated sources and 
joint decoding. In [2J and [3 1 the problem of compression with 
a rate-limited helper is considered. In 14|, Ho et al. considered 
the multicast problem with correlated sources which can be 
viewed as extending the Slepian-Wolf problem to arbitrary 
networks through network coding. Further extensions appeared 
in [5l and (^\. In all these studies the goal is to characterize 
the rate region for (fixed) static and memory-free networks, 
that is, the set of required capacities needed for a multicast. 

(Network Coded) Gossip: Gossip schemes were first in- 
troduced in |7| as a simple and decentralized way to dis- 
seminate a piece of information in a network. A detailed 
analysis of a class of these algorithms is given in |8|. In 
these schemes nodes communicate by continuously picking 
communication partners in a randomized fashion and then 
forwarding the information. The main figure of merit is the 
stopping time - the time needed for all nodes to be informed. 
Such randomized gossip-based protocols are attractive due to 
their locality, simplicity, and structure-free nature, and have 
been offered in the literature for various tasks. For the task 
of distributing multiple messages 19] introduced algebraic 
gossip, a network coding-based gossip protocol in which nodes 
exchange linear combinations of their available messages. This 
idea was extended to arbitrary networks in [10] and [11]. 
Haeupler [Tz] proved tight bounds for the stopping time of 
algebraic gossip for various models, including (adversarial) 
dynamically changing networks lfT3l and nodes with limited 
memory |14|. Improved bounds for non-uniform gossip were 
given in [15|. The projection analysis developed in lil2J will 
play a key role in this paper 

B. Our Contributions 

To our knowledge this paper is the first to combine these 
two strains of research and analyze gossip based protocols in 
networks with correlated data. Our contributions are manifold: 
First, we give a clean and general framework for oblivious net- 
work models in Sectionjip and define a setting for a correlated 
environment in Section |lll In this general setting we extend the 
projection analysis of 1121 by making a connection between 
the coefficient vectors a node knows and the amount and type 
of information it has learned. This results in simple, direct and 
self-contained proofs of tight bounds on the stopping time in 
the canonical models of one source and side information at the 
receivers, as well as two correlated sources. In Section |y| we 
then give results for the general scenario of multiple sources 
and side information. We do this by providing tight bounds 



on the time required for any set of (fractional) capacities 
to be induced by the (random) packet exchanges generated 
in an oblivious network model. This allows to transform 
results on the rate region of static memory-free networks (e.g., 
f4l, I'Sl) into bounds on the stopping times of gossip-based 
algorithms. These capacity bounds are interesting in their own 
right and have the potential to be useful in other information 
dissemination problems. 

II. Network and Information Model 

In this section we state the broadcast problem, define the 
communication model and the information model which and 
state the information model defining the nature of the source 
and side information. 

1 ) Network and Communication Model: For simplicity, we 
will assume that the network consists of a fixed set V of n 
nodes. Communication takes place in synchronous rounds. In 
each round, each node v decides on a packet py of s bits to be 
sent out (possibly using randomness). Given the current state 
of the network, the network model then specifies (possibly 
using randomness) which packet will be received by which 
node in the current round. This corresponds to a probability 
distribution over directed edge sets where a directed edge 
{v,u) means that the packet py is successfully received by 
node u. Nodes are assumed to have unlimited storage (for 
schemes with limited buffers see lfT4l '). We denote the set of 
directed edges chosen for round t with Et and call it the active 
edge set for time t. 

2) Source and Side Information: We assume k messages 
are initially distributed over the network. The i-th message 
constitutes of / i.i.d. samples from the random variable Xi, 
namely, a vector Xi. The message vectors xi,...,Xk are 
initially distributed to nodes (i.e., sources) such that each 
vector is given to at least one node. We also assume that 
each node v E V {or terminal) has some side information. 
The side information yy at node v E V is drawn as I i.i.d. 
samples from Yy. Note that the variables {X^j^Lj^ U {Yy}y(zv 
are arbitrarily correlated according to some known memoryless 
joint distribution. We are interested in the time when nodes are 
able to decode xi, . . . ,Xk based on their side information and 
the packets exchanged with other nodes. 

3 ) The Encoding and Decoding Schemes: For a given field 
size q and slack (5 > we assume throughout that nodes 
employ the following coding scheme: Prior to communication, 
source nodes perform random binning, that is, for every 1 < 
i < k each node receiving the message vector Xi applies the 
same random mapping into 2'(^("^>)+'') bins. The resulting bin 
indices (which are the same for every node initialized with Xi) 
are interpreted as vectors of length h = j^^{H{Xi) + S) 

over the finite field J^q. These vectors are then split into ^ 
blocks of symbols in J'q each, for a total of s bits per 
block. 

During the communication phase nodes sends out random 
linear combinations (over Fq) of these blocks as packets. To 
keep track of the linear combination contained in a packet 



one coefficient for each block of each message is introduced 
and sent in the header of each packet. As in all prior works 
on distributed network coding (e.g., H, lHOl, HI], fl^, lfT4l . 
021), we assume that is sufficiently large compared to 
the number of coefficients. This renders the overhead of the 
header negligible leaving a packet size of s bits as desired. 

At each node, independent linear equations on the blocks 
are collected for decoding. We denote with 5„ the subspace 
spanned by all coefficient vectors received at node v. We also 
use the following notion of knowledge from lfT2l : 

Definition 1. A node knows a coefficient vector fi iff Sy is 
not orthogonal to /i, that is, there exists a vector c E Sy such 
that (c, ^) ^ 0. 

Lastly, we will make use of the following lemma on random 
binning. 

Lemma 1 (161, iH). Let X e X and Y e y be two 

arbitrarily correlated random variables and let x, y be two 
vectors that are created by taking I i.i.d. samples from their 
joint distribution. Suppose, for some e > 0, all possible 
sequences in are randomly and uniformly divided to at least 
2i{H{x\Y)+S) jjiy^g Then joint typicality decoding correctly 
decodes x with high probability (as I — > oo) using y and 
any \{H{X\Y) + S)l~\ bits of information on the bin index of 
the bin in which the true x resides. 

In particular. Lemma [T] asserts that having access to the side 
information vector y, the message vector x can be decoded 
with high probability using any \^{H{X\Y) + 5)\ linearly 
independent equations on the blocks describing the bin index 
of X. 

III. Oblivious Network Models 

In this section we introduce the definition of an oblivious 
network model. This gives a clean and very general framework 
capturing a wide variety of communication and (dynamic) 
network settings. While this was akeady somewhat implicit 
in fV2\ we restrict ourself to networks without adaptive ad- 
versarial behavior. This greatly facilitates the much cleaner 
framework presented in this section. 

Definition 2. A network model is oblivious if the active edge 
set Et of time t only depends on t, E'^ for any t' < t and some 
randomness. We call an oblivious network model furthermore 
i.i.d. if the active edge set Et is sampled independently for 
every time t from a distribution of edge sets. 

The following are examples of oblivious (and i.i.d.) network 
models: In the (Uniform) Gossip Model [10], ifTTI . lfT2i one 
has an underlying (directed) graph G and in each round each 
node picks a random neighbor as a communication partner. A 
node then sends (PUSH) or requests (PULL) a packet from 
its partner or both (EXCHANGE). The Rumor Mongering ^ 
or Random Phone Calls Model |8| is a well-studied special 
case of this in which G is complete, that is, nodes pick a 
random node as a communication partner It is easy to include 
more sophisticated features in an oblivious network model. 



Random packet losses in wired networks, or characteristics 
of radio networks like half-duplex transmission, collisions, 
packet loss rates depending on SNR and more can be easily 
modeled by removing edges according to (randomized) rules. 
An interesting example of non-i.i.d. oblivious network models 
are (edge-)markovian evolving graphs [16 1. 

For any oblivious network model M we can define a 
random flooding process F{M,p, S). Informally, this process 
describes which nodes are informed over time if initially only 
nodes in S are informed and from there on every informed 
node informs all its communication partners (as specified by 
M). The only modification to this standard infectious process 
is the parameter p which adds an independent probability of 
p for each transmission to be overheard. 

Definition 3. Let M be an oblivious network model, p be 
a probability of fault and S* C be a starting set of 
nodes. We define the flooding process F{M,p,S) to be the 
random process 5i C 5*2 C ... that is characterized by 
Si = S and for every time t we define St+i by taking 
each of the (directed) edges Et specified by AI for time t 
independently with probability 1 — p to obtain E'^ and then we 
set 5*4+1 ^ {v eV \ 3u e St ■■ {u,v) e E[ V V = u). 

Note, that Definition|3|is only well defined if M is oblivious. 
Furthermore, F becomes a time-homogeneous Markov chain 
if M is an i.i.d. oblivious network model. Also, as long as 
for every time t the union over the edges in M from t to 
infinity is almost surely connected then F is absorbing with 
the unique absorbing state V . We say the flooding process 
F stops if it reaches this absorbing state and we denote the 
time this happens with the random variable Sf- The next 
definition pairs this flooding time with a throughput parameter 
a that corresponds to the exponent of the flooding process tail 
probability. The reason for this definition and its connection 
to the multi-message throughput behavior of network coding 
becomes apparent in the statement and proof of Theorem [ij 
below. 

Definition 4. We say an oblivious network model M on a 
node set V floods in time T with throughput a if there exists 
a prime power q such that for every v and every fc > 
we have P[5^(M,i/g,{.}) > T + fc] < 

To give a few illustrating examples of flooding times we 
note that the random phone call network model on n nodes 
floods in 0(logn) time with constant throughput. The uniform 
gossip model on a connected degree bounded graph G floods 
in time Q{D) and with constant throughput where D is the 
diameter of G. In many oblivious network models it is easy 
to give tight bounds on the flooding time and throughput. 

With this framework for oblivious network models in place 
we can give a cleaner restatement of Theorem 3 in [12]. We 
also provide a sketch of the proof since we will later expand 
on the ideas used therein. 

Tlieorem 1 (Theorem 3 of Ull). Suppose M is an oblivious 
network model that floods in time T with throughput a. Then, 



for any fc, random linear network coding in the network model 
M spreads fc arbitrarily distributed messages to all nodes with 
probability 1 — e after T' = T + ^{k + loge~^) rounds. 

Proof: The random linear network coding protocol we 
analyze will use the same field size that achieves the parame- 
ters T and a for M in Definition ID We fix a coefficient vector 
fi ^ F^. This vector is initially known to a non-empty subset 
S of nodes. It is easy to check that the probability that a node 
V does not know /i after it has received a packet from a node 
that knows is at most 1/q. This implies that knowledge 
of /i spreads through the network exactly as the flooding 
process FM.i/q,s- Using the assumption. Definition |4] and the 
monotonicity of Sp(M,i/q,s) in S we get that the probability 
that a vector /i e F^ is not known to all vectors after T' steps 
is at most q^('^+^°s<^ ). A union bound over all q'^ vectors 
shows that with probability at least 1 — > 1 — e all 

node will know about all vectors and it is easy to see that this 
implies that all nodes are able to decode the messages. ■ 

IV. Simple, Direct Proofs for Tight Stopping Times 

In this section we give a simple, direct derivation of tight 
stopping time bounds for gossip with one source and side 
information at the nodes and gossip with two correlated 
sources. Our two main results in this section are the following. 

Ttieorem 2. Suppose M is an oblivious network model that 
floods in time T with throughput a. We assume a single mes- 
sage X generated from X and side information generated 
from Yy at every node. Fix an error probability e > 0. Then, 
for any S > 0, far any large enough block length I and any 
packet size s, node v will correctly decode x with probability at 
least I- e after T' = T+^ + 5) + lege"! + 3) 

rounds. 

Tfieorem 3. Suppose M is an oblivious network model 
that floods in time T with throughput a. We assume two 
messages xi,X2 are generated from Xi,X2 and nodes have 
no side information. Fix an error probability e > 0. Then, 
for any 5 > and for large enough I, with probability 
at least 1 — e every node will correctly decode xi,X2 after 
r+i([|(7J(Xi,X2) + 25)] +loge-i+3) rounds. 

The idea for proving these theorems is to generalize the 
observation of |[T2l that the question of when a node can 
decode is equivalent to determining when this node knows 
(see Definition [l]) enough coefficient vectors. The proof of 
Theorem [l] shows that flooding or spreading of knowledge of 
vectors can be analyzed using a union bound. This implies 
that only the number of vectors needed is of importance. In 
the case with uncorrelated sources and no side information 
essentially knowledge of all coefficient vectors is necessary. 
In the correlated scenario, however, we want to relate the 
number of vectors a node v needs to know to the conditional 
entropy H{X\Yy). Lemma [I] helps in this respect. It asserts 
that in order to decode, a node with side information Y does 
not necessarily need i — \{H{X\Y) + S)l~\ specific bits, but 
rather, assuming joint typicality decoding, it requires only 



any sufficient amount of information about the index of the 
bin in which x resides. This is achieved by any i/s packets 
containing independent equations on the bin index. We can 
thus focus on the knowledge a node is required to obtain in 
order for its coefficient matrix to have rank at least i/s. 

Unfortunately it is possible that a node knows many vectors 
without having a large rank. In fact, upon reception of the first 
packet a node gets to know all but a.l/q fraction of all vectors. 
On the other hand, in order to prove faster stopping times we 
want to argue that the knowledge of only an exponentially 
small fraction of all vectors suffices for decoding. This is 
achieve by the following lemma which shows that indeed only 

specific coefficient vectors suffice to guarantee that at least 
I independent coefficient vectors were received: 

Lemma 2. Let V be be a finite dimensional vector space over 
a finite field Tq. For every < h < dim V there exist w — 
+ 1 vectors vi,. . . ,v^, G V such that for any (subspace) 
K C V for which does not contain Vi for any i has 
dimension at least h + 1. 

It is now possible to prove the two main results of this 
section. Their proofs are self-contained and involve only 
random binning (Lemma [l]) and Lemma 0: 

Proof Sketch of Theorem |2} We use the field size 
q that achieves the parameters T and a in Definition 0. 
We furthermore choose I large enough so that the decoding 
probability in Lemma [l] is at most e/2. By Lemma [H any 
node V with access to the side information vector yy and 
\^{H{X\Yu) +(5)] independent equations on the blocks de- 
scribing the bin index of x assigned by the random binning 
procedure, can decode x with probability at least 1 — e/2. 
It thus remains to show that with probability 1 — e/2 we 
have dim5„ > \^{H{X\Yy) + 6)] after T' rounds. To prove 
this we apply Lemma |2] and get that there exists a set Z 
of 2r-|('f^(-^l"^")+'^)1 coefficient vectors such that if v has 
knowledge of these vectors, it indeed has sufficiently many 
independent equations. Furthermore, we refer to the proof of 
Theorem[l] for the fact that knowledge of any coefficient vector 
(in Z) spreads through the network like a flooding process. As 
before we thus get the fact that in the assumed network model 
M the probability that any of the coefficient vectors (in Z) is 
not known after T + i(fc + (loge^^ + 1)) rounds is smaller 
than e/2 • q^'^. Setting k = log \Z\ and using a union bound 
over all coefficient vectors in Z we get as a result that indeed 
after T' rounds the probability that v has received sufficiently 
many independent coefficient vectors is at least 1 — e/2. ■ 

While the proof of Theorem |3] is similar in nature to that of 
Theorem m there is delicate point when considering multiple 
sources. In a single source scenario, for each terminal there is 
only one equation governing the rate. That is, r > H{X\Yy) + 
6. Using Lemma 12] this rate constraint is translated into a rank 
constraint, namely, dim{Sy) > \ ^{H{X\Yy) + S) \ . For more 
than one source, however, the rate region is given by multiple 
rate constraints, and one has to make sure all are satisfied. 
Indeed, for two sources and no side information at the nodes 
this can be done using a single rank constraint. For more than 



two sources, or when additional side information is available, 
a more refined analysis is required. This is the subject of the 
next section. 

V. Characterizing Capacities in Oblivious 
Network Models 

To date, analysis of gossip schemes focused only on the 
dissemination time - the number of rounds required to gain the 
complete knowledge in the network. However, especially when 
dynamic networks are analyzed, it is interesting to gain a more 
accurate measure of the actual capacities achievable between 
sets of nodes. Namely, to analyze the capacities induced by 
the packet exchange process in algebraic gossip. This is an 
interesting question in its own right, and, in particular, can give 
a "black-box" method to transfer any results of prior works 
that bound the rates or capacities needed between sources and 
sinks in the static memory-free setting to stopping times in 
oblivious network models. 

Herein, we develop such a characterization, and apply it to 
the results of (4] and (U\ to obtain stopping times for gossip 
protocols with an arbitrary number of correlated sources and 
side-information, generalizing the results from the last section. 
We first introduce the required notation. 

Definition 5. Let T > 0, node set V and active edges 
El, E2, ■ ■ ■ , Ex be given. We define a path P from s to d 
to be a sequence of nodes P — {vo,vi, . . . ,vt) such that 
vq — s, vt — d and for every t < T we have vt-i — Vt 
or {vt^i,vt) e Et. We furthermore define a set of m paths 
Pi, ... , Pm with weights wi, . . . , Wm to be valid if for every 
t < T and every {u, v) S Et the weights of paths using {u, v) 
sum to at most one. Lastly, we say a set of valid weighted 
paths achieves a capacity of c between two nodes s and d if 
the weights of paths from s to d sum up to c. 

Quite intuitively these paths correspond to an information 
flow through the network from the sources to the sink. This 
intuition can be made formal and one can give an explicit 
equivalence between the algebraic gossip protocol and random 
linear network coding in the classical memory-free model 
(e.g., H). This was done in ITtI which also describes the 
hypergraph GpMC that corresponds to a sequence of active 
edges. We omit the details of this equivalence and instead 
only recall the facts needed in this paper: 

Lemma 3. Let node set V, the active edges Ei, . . . , Et, 
destination d £ V and sources Si, S2, ■ ■ ■ , Sk G V be given. 
Algebraic gossip on {Ei}J^i is equivalent to classical random 
linear network coding in the transformed hypergraph GpNC 
described in fl7^. In particular, if for some integers ci, . . . ,Ck, 
it is possible for every Si to transmit Ci packets to d, then there 
exists a sequence of valid paths of weight one and a rate 
Ci between Si and d. Conversely, if for some positive reals 
Ci , . . . , Cfc there is a set of valid weighted paths that achieve 
a capacity Ci between Si and d, then the capacities Ci lie in 
the min-cut region of GpMC- 

Given this setup we show the first result in this direction: 



Lemma 4. Let M be a network model on a node set V 
that floods in time T with throughput a. For any T' , any 
e > 0, any destination d & V and any set of k source nodes 
Si, S2, ■ ■ ■ , Sk (z V with integral capacities ci, C2, . . . , Cfc > 1 
suppose El, ... , Et' is a sequence of active edges on V 
sampled from M. If T' > T + ^(X^i + loge~^) then with 
probability at least 1 — e there exists a selection of valid 
weighted paths that achieve the capacity Ci between Si and 
t for every i. 

Proof: We think of putting Ci messages at node Si and 
run the standard algebraic gossip protocol for T' rounds using 
the field size q that achieves the parameters T and a on M. 
Theorem [l] now shows that with probability 1 — e the sink t 
can decode the messages. According to Lemma [s] this shows 
that there are mutually disjoint paths from Si to d for every 
i with weight one which achieve the desired capacities. ■ 

Note that the above lemma requires the capacities to be 
integral and thus essentially asks for the time until a certain 
number of mutually disjoint paths occur. While this is suffi- 
cient and optimal in the uncorrelated information spreading 
setting this requirement can be a severe restriction. 

One setting where this makes a drastic difference is when 
we have k sources and the total capacities needed sum up to 
less then one. This corresponds to asking for the time until 
there is a path from each of the sources to the sink - without 
these paths having to be disjoint. If one considers for example 
the random phone call model with n nodes and k sources it 
takes in expectation log n+k time until a disjoint path between 
a node and each source appears while merely log n + log k 
rounds are sufficient to get this for non-disjoint paths. 

The following lemma generalizes this observation and 
strengthens Lemma in this direction to give order optimal 
bounds for any set of fractional capacities: 

Lemma 5. Let M be a network model on a node set V that 
floods in time T with throughput a. For any T', any e > 0, any 
sink d £ V and any set of k source nodes si, S2, . ■ . , Sk G V 
with rates ci,C2,. . . , Cfc > 0, ifT' > T+^{\J2^Ct] +logfc + 
loge^^) then with probability at least 1 — e there exists a 
selection of valid weighted paths that achieve a capacity of Ci 
between Si and d for every i. 

Proof: The idea is to combine k applications of LemmaQ 
using a union bound and capacity sharing. We set the failure 
probability to e/k and in the ith application of Lemma we 
set the Ci to Ci\ while setting all other capacities to zero. 
Out of this we get that for every i with probability 1 — e/fc the 
number of disjoint paths from Si to d is at least Ci\ . Via 
a union bound we get that with probability of 1 — e all these 
paths are there. Note, that while the paths from each source 
are disjoint the paths starting at different sources may not be 
disjoint. We now take the union of these paths while choosing 
the weight of a paths starting at source Si to be — r. This 
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gives capacity of a between Si and d. Furthermore, since any 
edge e is used by at most one path going out from each source, 
we get that the total weight on e summed over all paths is at 



We will use Lemma [s] to prove our main result about 
information dissemination with correlated data in oblivious 
networks. To state our result we need the following definition: 

Deflnition 6 (Slepian-Wolf region flT]). A capacity vector c — 
(ci, . . . ,Cfe) is sufficient for v £ V if and only if for every 
index subset S € [k] we have J^ies'^i — (^s I ^in- 
putting together Lemma Q Lemma [s] and applying the 
results on network coding with correlated data from [41 and 
||6l in a black-box manner, we can now directly state our 
main result which generalizes and encompasses Theorem [s] 
and Theorem 13: 

Theorem 4. Suppose M is an oblivious network model that 
allows floods in time T with throughput a. Then, for any 5 > 
and error probability e > 0, there exists an I such that for any 
joint distribution of Xi, . . . , and the Yy 's, any packet size 
s > 0, any node v and any capacity vector (ci, . . . , Cfc) that 
is sufficient for v, node v will correctly decoding xi, . . . ,Xk 
with probability at least 1 — e after T + ^( | + <J + 

log k + log e^^ + (5) rounds. 
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